[TLDR] Think Only When You Need with Large Hybrid-Reasoning Models
Adaptive Hybrid-Reasoning in Large Language Models for Efficient Problem Solving
Authors: L. Jiang et al.
Published on Arxiv: 2025-05-20
Link: http://arxiv.org/abs/2505.14631v1
Institutions: Microsoft Research • Peking University
Keywords: Large Language Models, Large Reasoning Models, Hybrid Reasoning, Reinforcement Learning from Human Feedback, Efficient AI, Policy Optimization, System 1 and System 2 reasoning, Adaptive Reasoning, Qwen-2.5, DeepSeek-R1, Hybrid Fine-Tuning, Hybrid Group Policy Optimization, Hybrid Accuracy, Mathematical Reasoning, Code Generation, General Problem Solving
Recent advancements in Large Reasoning Models (LRMs) have greatly improved reasoning abilities over standard Large Language Models (LLMs), particularly for mathematics, programming, and complex problem-solving. However, excessive or unnecessary reasoning—especially on simple queries—introduces notable computational and latency costs, highlighting the need for more efficient and context-adaptive AI systems.
To address these efficiency challenges, the authors propose an adaptive hybrid-reasoning framework with several key contributions:
- Introduction of Large Hybrid-Reasoning Models (LHRMs), which flexibly choose between explicit ‘Thinking’ and direct ‘No-Thinking’ modes based on query context, mirroring human cognitive adaptability.
- A two-stage training pipeline combining Hybrid Fine-Tuning (HFT) on annotated data and Hybrid Group Policy Optimization (HGPO), a new reinforcement learning algorithm, enabling robust selection of the appropriate reasoning strategy.
- Presentation of Hybrid Accuracy (HAcc), a novel metric to measure a model’s ability to select the correct reasoning approach.
- Comprehensive experiments on the Qwen-2.5 series, evaluated against strong baselines across benchmarks in mathematics, coding, and general problem solving.
The experimental results clearly demonstrate the advantages of this approach:
- LHRMs consistently outperform both baseline LLMs and LRMs on diverse benchmarks, excelling in tasks demanding both straightforward replies and multi-step reasoning.
- On key tasks (e.g., AIME24 and Arena-Hard), LHRMs achieve substantial accuracy and efficiency gains, including a +13.6% improvement on AIME24 and +93.4% on Arena-Hard with the 7B parameter model.
- The models generalize well across domains, efficiently reducing unnecessary reasoning on simple queries, and thus improving latency and user experience.
- The HGPO framework enables stable and effective reinforcement learning from human feedback for hybrid reasoning policies.
In summary, the work paves the way for more efficient, human-aligned AI assistants by innovatively balancing direct answering and complex reasoning:
- LHRMs are the first models to robustly and adaptively combine explicit reasoning with direct responses for real-world tasks.
- HAcc offers a standard for evaluating context-sensitive reasoning in hybrid models.
- These advances promise more capable, user-friendly AI by strategically controlling when and how deep reasoning processes are invoked.
- Future work is suggested in further extending generalization and refining control over reasoning behavior.